Statistical Design and Analysis of RNA - Seq Data Paul

نویسنده

  • Paul L. Auer
چکیده

Next-generation sequencing technologies are quickly becoming the preferred approach for characterizing and quantifying entire genomes. Even though data produced from these technologies are proving to be the most informative of any thus far, very little attention has been paid to fundamental design aspects of data collection and analysis, namely sampling, randomization, replication, and blocking. We discuss these concepts in an RNA-Sequencing framework. Using simulations we demonstrate the benefits of collecting replicated RNASequencing data according to well known statistical designs that partition the sources of biological and technical variation. Examples of these designs and their corresponding models are presented with the goal of testing differential expression. Statistical Design and Analysis of RNA-Seq Data 3 Next-generation sequencing (NGS) has emerged as a revolutionary tool in genetics, genomics, and epigenomics. By increasing throughput and decreasing cost, compared to other sequencing technologies (Hayden 2009), NGS has enabled genome-wide investigations of various phenomena, including single nucleotide polymorphisms (Craig et al. 2008), epigenetic events (Park 2009), copy number variants (Alkan et al. 2009), differential expression (Bloom et al. 2009), and alternative splicing (Sultan et al. 2008). One application with demonstrated effectiveness over previous technologies (e.g., microarrays and Serial Analysis of Gene Expression (SAGE)) is called RNA-Sequencing (RNA-Seq) (Cloonan et al. 2009). RNA-Seq uses NGS technology to sequence, map, and quantify a population of transcripts (Mortazavi et al. 2008; Morozova et al. 2009). While RNA-Seq is a relatively new method, it has already provided unprecedented insights into the transcriptional complexities of a variety of organisms, including yeast (Nagalakshmi et al. 2008), mice (Mortazavi et al. 2008), Arabidopsis (Eveland et al. 2008), and humans (Sultan et al. 2008). At present, there are three widely accepted commercially available NGS devices (Illumina’s Genome Analyzer, Applied Biosystems’ SOLiD, and the 454 Genome Sequencer FLX) for RNA-Seq (Marioni et al. 2008; Cloonan et al. 2008; Eveland et al. 2008). Across platforms, the RNA-Seq methodology is approximately the same. Briefly, RNA is isolated from cells, fragmented at random positions, and copied into complementary DNA (cDNA). Fragments meeting a certain size specification (e.g., 200–300 bases long) are retained for amplification using Polymerase Chain Reaction (PCR). After amplification, the cDNA is sequenced using NGS; the resulting reads are aligned to a reference genome, and the number of sequencing reads mapped to each gene in the reference is tabulated. These gene counts, or Digital Gene Statistical Design and Analysis of RNA-Seq Data 4 Expression (DGE) measures, can be transformed and used to test differential expression (see Morozova et al. 2009 for a review of these technologies as applied to RNA-Seq). Although there are many steps in this experimental process that may introduce errors and biases, RNA-Seq has been hailed as the future of transcriptome research (Shendure 2008) because it potentially generates an unlimited dynamic range, provides greater sensitivity than microarrays, is able to discriminate closely homologous regions, and does not require a priori assumptions about regions of expression (Cloonan et al. 2009; Morozova et al. 2009). As research transitions from microarrays to sequencing-based approaches, it is essential that we revisit many of the same concerns that the statistical community had at the beginning of the microarray era (Kerr and Churchill 2001a). Soon after the introduction of microarrays (Schena et al. 1995), a series of papers was published elucidating the need for proper experimental design (Kerr et al. 2000; Lee et al. 2000; Kerr and Churchill 2001a; Kerr and Churchill 2001b; Churchill 2002). All of these papers rely heavily on the three fundamental aspects of sound experimental design formalized by R. A. Fisher (1935a) seventy years ago, namely replication, randomization, and blocking. These concepts can be understood by considering the following controlled experiment that is designed to test the effectiveness of two different diets. A sound experimental design would include many different subjects (i.e., replication) recruited from multiple weight loss centers (i.e., blocking). Each center would randomly assign their subjects to one of the two diets (i.e., randomization). Although the principles of good design are straightforward, their proper implementation often requires significant planning and statistical expertise. To date, many NGS applications, specifically RNA-Seq, have neglected good design. While a few RNA-Seq studies have reported highly reproducible results with little technical variation (e.g., Marioni et al. 2008; Mortazavi et Statistical Design and Analysis of RNA-Seq Data 5 al. 2008), in the absence of a proper design, it is essentially impossible to partition biological variation from technical variation. When these two sources of variation are confounded there is no way of knowing which source is driving the observed results. No amount of statistical sophistication can separate confounded factors after data have been collected. Generally, for differential expression analyses, researchers are interested in comparisons across treatment groups in the form of contrasts or pair-wise comparisons, and the designs for these analyses are usually quite simple. The good news for NGS technologies is that certain properties of the platforms can be leveraged to ensure proper design. One such feature, available in all three NGS devices, is the capacity to bar-code. Genomic fragments can be labeled or barcoded with sample-specific sequences that in turn allow multiple samples to be included in the same sequencing reaction (i.e., multiplexing) while maintaining, with high fidelity, sample identities downstream (Craig et al. 2008; Hamaday et al. 2008; http://www3.appliedbiosystems.com/). To date, bar-coding has only been appreciated as a means to increase the number of samples per sequencing run. Yet here, we demonstrate how multiplexing can be used as a quality control feature that offers the flexibility to construct balanced and blocked designs for the purpose of testing differential expression. We anticipate that the progression from the current un-replicated unblocked designs to more complex designs will be swift once the full offerings of NGS technologies are appreciated. Toward this end, we provide a brief review of some powerful statistical techniques for testing differential expression under a variety of designs. Although the designs that are presented are specific to RNA-Seq using the Illumina (Solexa) platform, the same statistical principles are applicable to the other NGS devices, as well as other types of comparative genetic and `omic data. Statistical Design and Analysis of RNA-Seq Data 6 REPLICATION Un-replicated data: Observational studies with no biological replication are common in the RNA-Seq literature (e.g., Marioni et al. 2008). In an observational study, as opposed to a controlled experiment, the assignment of subjects to treatment groups is not decided by the investigator. In many cases, the different treatment groups consist of different tissue types. For example, in Marioni et al. (2008) messenger RNA (mRNA) was isolated from liver and kidney tissues, randomly fragmented, and sequenced using the Illumina Genome Analyzer (GA). The Illumina technology (aka “Solexa”) relies on a flow-cell with eight lanes, or channels, and massively parallel sequencing by synthesis to simultaneously sequence millions of short DNA fragments in each of the lanes. Typically, independent samples of mRNA are loaded into different lanes of the flow-cell such that sequencing reactions occur independently between samples. For illustration purposes, consider an example with seven subjects and seven treatment groups (T1,...,T7), where each subject is randomly assigned to one treatment group, and mRNA from each subject is loaded into a different lane (Figure1). Notice that there is no biological replication because there is only a single subject in each treatment group. Statistical Design and Analysis of RNA-Seq Data 7 FIGURE 1. Hypothetical Illumina GA flow-cell with mRNA isolated from subjects within seven different treatment groups and loaded into individual lanes (e.g., the mRNA from the subject within treatment group 1 is sequenced in Lane 1). As a control, a 1 7 ( ,..., ) T T X Φ genomic sample is often loaded into Lane 5. The bacteriophage X Φ genome is known exactly, and can be used to recalibrate the quality scoring of sequencing reads from other lanes (Bentley et al. 2008). In order to analyze data from un-replicated designs, the sampling hierarchy must be taken into account. Regardless of the design, we can define three levels of sampling at work in RNASeq data: subject sampling, RNA sampling, and fragment sampling. Subjects (e.g., organisms or individuals) are ideally drawn from a larger population to which results of the study may be generalized (un-replicated data consists of a single subject within each treatment group). RNA sampling occurs during the experimental procedure when RNA is isolated from the cell(s). Finally, only certain fragmented RNAs that are sampled from the cells(s) are retained for amplification, and since the sequencing reads do not represent 100 percent of the fragments loaded into a flow-cell, fragment-level sampling is also at play. Un-replicated data consider only a single subject per treatment group. Typically either there is one subject to which every treatment is applied (e.g., in Marioni et al. (2008), liver and kidney samples were extracted from one human cadaver), or one distinct subject within each treatment group (e.g., Figure 1). In either situation, it is not possible to estimate variability within treatment group, and the analysis must proceed without any information regarding within-group Statistical Design and Analysis of RNA-Seq Data 8 TABLE 1 A 2x2 contingency table of (un-replicated) digital gene expression (DGE) measures for testing differential expression between Treatment1 and Treatment2 of Gene A. The cell counts represent the DGE count for Gene A (k = 1) or the Remaining Genes (k = 2) for Treatmenti, i=1,2. The marginal row total is denoted , is the marginal total for column i, is the grand total. ki n th k . k N .i N

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Differential expression--the next generation and beyond.

RNA-sequencing (RNA-seq) technologies have not only pushed the boundaries of science, but also pushed the computational and analytic capacities of many laboratories. With respect to mapping and quantifying transcriptomes, RNA-seq has certainly established itself as the approach of choice. However, as the complexities of experiments continue to grow, there is still no standard practice that allo...

متن کامل

Power analysis and sample size estimation for RNA-Seq differential expression.

It is crucial for researchers to optimize RNA-seq experimental designs for differential expression detection. Currently, the field lacks general methods to estimate power and sample size for RNA-Seq in complex experimental designs, under the assumption of the negative binomial distribution. We simulate RNA-Seq count data based on parameters estimated from six widely different public data sets (...

متن کامل

Clustering of Short Read Sequences for de novo Transcriptome Assembly

Given the importance of transcriptome analysis in various biological studies and considering thevast amount of whole transcriptome sequencing data, it seems necessary to develop analgorithm to assemble transcriptome data. In this study we propose an algorithm fortranscriptome assembly in the absence of a reference genome. First, the contiguous sequencesare generated using de Bruijn graph with d...

متن کامل

The importance of study design for detecting differentially 1 abundant features in high - throughput experiments

8 The use of high-throughput experiments, such as RNA-seq, to simultaneously identify 9 differentially abundant entities across conditions has become widespread, but the systematic 10 planning of such studies is currently hampered by the lack of general-purpose tools to do so. 11 Here we demonstrate that there is substantial variability in performance across statistical 12 tests, normalization ...

متن کامل

I-13: Transcriptome Dynamics of Human and Mouse Preimplantation Embryos Revealed by Single Cell RNA-Sequencing

Background: Mammalian preimplantation development is a complex process involving dramatic changes in the transcriptional architecture. However, it is still unclear about the crucial transcriptional network and key hub genes that regulate the proceeding of preimplantation embryos. Materials and Methods: Through single-cell RNAsequencing (RNA-seq) of both human and mouse preimplantation embryos, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010